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Measurement  error,  or  reliability,  affects  many  common  applications  in  statistics,  such  as  corre¬ 
lation,  partial  correlation,  analysis  of  variance,  regression,  factor  analysis,  and  others.  Despite  its 
importance,  the  role  of  measurement  error  in  these  familiar  statistical  applications  often  receives 
little  or  no  attention  in  textbooks  and  courses  on  statistics.  The  purpose  of  this  article  is  to  exam¬ 
ine  the  role  of  reliability  in  familiar  statistics  and  to  show  how  ignoring  the  consequences  of  (less 
than  perfect)  reliability  in  common  statistical  techniques  can  lead  to  false  conclusions  and 
erroneous  interpretation. 

Keywords:  reliability;  statistics;  inference;  interpretation 

Measurement  error  is  an  integral  part  of  measurement  and  is  frequently  indexed  by  reli¬ 
ability.  The  reliability  of  a  measure  is  the  ratio  of  true  variability  to  total  variability.  In 
simple  nontechnical  language,  reliability  means  precision.  Most  educational  researchers  and 
psychologists  learned  about  reliability  in  courses  in  psychometrics.  Statistical  techniques 
such  as  descriptive  statistics,  regression,  or  analysis  of  variance  are  taught  in  separate 
courses.  In  some  instances,  the  two  are  combined,  such  as  when  Spearman’s  true  correlation 
is  introduced;  this,  however,  is  frequently  the  only  time.  More  recently,  courses  covering 
meta-analytic  techniques  frequently  bridge  that  gap.  The  purpose  of  this  article  is  to  show  the 
role  of  reliability  in  familiar  statistics  and  to  show  how  ignoring  the  consequences  of  (less 
than  perfect)  reliability  in  common  statistical  techniques  can  lead  to  false  conclusions  and 
erroneous  interpretation.  Because  of  their  widespread  use  in  applied  research  and  applica¬ 
tion,  we  will  illustrate  the  role  of  reliability  with  examples  from  descriptive  statistics,  z  tests 
and  t  tests,  correlation,  partial  correlation,  linear  regression,  test  bias  analysis,  factor 
analysis,  analysis  of  variance,  and  analysis  of  covariance. 


Authors’  Note:  The  opinions  expressed  are  those  of  the  authors  and  are  not  necessarily  those  of  the  U.S.  govern¬ 
ment,  the  Department  of  Defense,  or  the  U.S.  Air  Force.  Please  address  correspondence  to  Malcolm  James  Ree, 
Our  Lady  of  the  Lake  University,  41 1 SW  24th  Street,  San  Antonio,  TX  78207-4689;  e-mail:  mree@satx.rr.com. 
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Measurement  Error  Model 

The  true  score  model  is  the  most  frequently  used  measurement  error  model  (see  Fuller, 
1987).  In  this  model,  the  basic  equation  states  that  the  observed  score  is  equal  to  the  true  score 
plus  an  error  score.  Furthermore,  the  error  score  is  assumed  to  be  random  and  therefore  inde¬ 
pendent  of  the  true  score.  The  true  score  and  error  score  are  not  correlated.  If  the  error  score  is 
not  random,  the  result  is  called  bias  and  the  consequences  may  be  situationally  specific. 
There  are  measurement  models  for  nonrandom  error,  but  the  current  article  is  limited  to  ran¬ 
dom  error.  The  basic  true  score  equation  is  O  =  t  +  e,  or  Observed  =  True  +  Error. 

This  yields1 


^  observed  +<*L  +  °Lr-  (D 

By  definition,  reliability  (Stanley,  1971)  is  the  ratio  of  true  score  variance  to  observed 
score  variance  or  reliability  =  rxic  =  G2lrue/o20bs.  This  is  equivalent  to  rxr  =  1  -  (o2erro/a2obs),  or 
reliability  equals  1  minus  the  proportion  of  error  variance  to  observed  variance.  The  effects  of 
measurement  error  on  familiar  statistical  techniques  can  be  determined  from  these  equations. 
The  purpose  of  the  current  effort  is  to  explain  the  consequences  of  the  use  of  less  than  per¬ 
fectly  reliable  measures  in  statistical  analyses. 

The  Variable  Is  Not  Reliable 

It  is  not  unusual  to  hear  people  say,  “That  test  is  reliable”  or  “That  is  a  reliable  measure”  of 
some  construct.  However,  Thompson  (2003)  has  forcefully  made  the  point  that  a  test  (or 
other  measured  variable)  is  neither  reliable  nor  unreliable.  Reliability  concerns  the  scores  of 
the  measure  and  is  a  consequence  of  the  sample  at  hand.  “It  is  important  to  evaluate  score  reli¬ 
ability  in  all  studies,  because  it  is  the  reliability  of  the  data  in  hand  that  will  drive  study 
results,  and  not  the  reliability  of  the  scores  described  in  the  test  manual”  (Thompson,  2003, 
p.  5).  Your  sample  will  almost  surely  differ  from  the  normative  sample  reported  in  the  test 
manual.  It  may  differ  in  composition  by  gender,  ethnicity,  age,  experience,  education,  testing 
circumstances,  or  many  other  variables.  These  differences  will  cause  the  reliability  of  your 
sample  to  be  different  from  the  reliability  reported  in  the  manual,  and  your  results  will  be 
driven  by  the  reliability  of  your  sample. 

Descriptive  Statistics 

Mean 

The  effect  of  unreliability  on  the  mean  is  benign.  Because  the  error  score  is  random,  the 
mean  of  the  error  score  is  expected  to  be  zero.  Therefore,  the  expectation  of  the  observed 
mean  equals  the  true  mean.  The  bias  caused  by  measurement  error  on  the  observed  mean 
is  nil. 
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Variance  and  Standard  Deviation 

The  effects  of  measurement  error  on  the  variance  and  standard  deviation  are  not  so  agree¬ 
able.  Returning  to  the  true  score  equation,  we  see  that  the  observed  variance  is  the  sum  of  the 
true  and  error  variances  (o2obscrvcd  =  o2m,c  +  G2cimT).  Consequently,  the  observed  variance  is 
greater  when  error  increases.  Observed  score  variance  is  always  greater  than  true  score  vari¬ 
ance  when  the  variable  has  been  measured  with  less  than  perfect  reliability.  If  the  true  vari¬ 
ance  (o2u,lc)  is  1 00  and  the  error  variance  (o2tnor)  is  1 0,  the  observed  variance  (52obs)  will  be  1 1 0. 
If  the  true  variance  remains  100  and  the  error  variance  is  20,  the  observed  variance  will  be 
120.  Note  that  the  effect  on  the  standard  deviations  will  appear  to  be  less  at  10.49  (Vl  10)  and 
10.95  (Vl20).  In  these  cases,  the  reliability  of  the  two  scores  would  be  .91  (o2me/a2obs  =  100/ 
1 10)  and  .83  (o2mc/a20bs  =  100/120),  respectively.  The  biasing  influence  on  effect  size  will  be 
discussed  in  a  subsequent  section. 

The  z  Test  and  the  t  Test 

The  basic  form  of  the  z  test  and  the  t  test  is  a  sample  statistic  minus  a  population  parameter 
in  the  numerator,  divided  by  a  standard  error  in  the  denominator.  In  the  case  of  the  z  or  f  test  of 
a  mean,  the  benign  effect  on  the  mean  precludes  changes  in  the  numerator.  The  effect  of  reli¬ 
ability  is  found  in  the  denominator.  The  standard  error  (or  estimated  standard  error  in  the  t 
test)  is  the  standard  deviation  divided  by  the  square  root  of  n,  the  sample  size. 

Consider  the  two-sample  (or  independent-samples)  two-tailed  z  test  using  a  .05  Type  I 
error  rate.  With  a  difference  between  the  means  of  3.6,  a  sample  size  of  30,  and  a  true  standard 
deviation  (i.e.,  measured  without  error,  =  1 .0)  of  10,  the  computed  z  value  would  be  signif¬ 
icant  at  1.972.  If  the  standard  deviation  were  increased  to  1 1  by  unreliability  (rxd  =  .83),  the  z 
test  statistic  would  not  be  significant  with  a  value  of  1 .793 .  If  the  reliability  were  reduced  fur¬ 
ther,  the  z  value  also  would  be  reduced.  For  example,  with  the  same  sample  size  and  mean  dif¬ 
ference,  but  reliability  reduced  to  r ^  =  .625  (observed  standard  deviation  =  1 6),  the  z  value  is 
1 .232  and  would  not  be  significant. 

Confidence  Intervals 

Another  way  to  evaluate  the  effects  of  unreliability  is  to  look  for  differences  in  the  width  of 
confidence  intervals.  Addition  of  error  variance  to  true  variance  causes  the  confidence  inter¬ 
vals  to  increase.  With  a  sample  size  of  30  and  a  true  standard  deviation  of  10,  when  the  reli¬ 
ability  is  1 .0,  the  true  standard  error  is  1 .826.  If  the  reliability  were  reduced  to  .830  or  to  .625, 
the  standard  errors  become  2.008  and  2.921,  respectively.  The  confidence  interval  becomes 
wider. 

Effect  Size 

Less  than  perfect  reliability  also  will  have  an  influence  on  effect  size  ((fij  -  (i2)/af)  (see 
Baugh,  2002,  for  an  insightful  discussion  of  the  issue).  Russell  and  Peterson  (2002)  reported 
the  effect  size  for  African  American  means  versus  White  means  on  a  series  of  tests  in  a 
research  project.  They  discuss  a  spatial  test  called  Reasoning,  which  had  an  effect  size  of  .77. 
Russell  and  Peterson  noted  that  many  tests  show  an  African  American  versus  White  effect 
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size  of  1 .0,  and  their  experimental  tests  were  on  average  less  than  1 .0.  The  Reasoning  test  had 
a  test-retest  reliability  of  .65,  and  correcting  the  effect  size  for  this  unreliability,  the  true  effect 
size  becomes  .96,  very  close  to  the  size  reported  frequently  for  such  differences.  This  change 
in  effect  size  occurred  because  of  the  change  in  the  estimate  of  the  standard  deviation  when 
unreliability  was  accounted  for.  A  different  conclusion  about  the  Reasoning  test  would  have 
been  reached  had  unreliability  been  taken  into  account. 

Clearly,  unreliability  causes  a  reduction  in  statistical  power,  an  artifactual  increase  in  con¬ 
fidence  intervals,  and  a  bias  in  estimating  effect  size.  Ignoring  the  effects  of  unreliability  will 
lead  to  inappropriate  conclusions  and  inferences  about  the  tests  and  the  constructs  being 
studied. 


Correlation 

With  the  increased  popularity  of  the  meta-analytic  technique  of  validity  generalization 
(Hunter  &  Schmidt,  1990,  2004),  the  correction  for  attenuation  has  become  well  known 
again,  at  least  by  industrial/organizational  psychologists  (see  Ree  &  Earles,  1 993).  Spearman 
(1904)  demonstrated  that  the  correlation  between  the  observed  scores  of  two  variables  was  a 
function  of  the  reliability  of  the  two  variables.  The  well-known  formula  that  expresses  this  is 


where  rc  is  the  estimate  of  the  true  correlation  (sometimes  written  where  x  indicates  true 
score),  r„  is  the  observed  attenuated  correlation,  and  and  are  the  reliabilities  of  X  and  Y, 
respectively. 

For  example,  if  two  measures  of  the  same  construct  (true  score  correlation  of  1.0)  each 
have  a  reliability  of  .8,  the  maximum  correlation  between  the  two  (rAT)  is  .8.  If  one  of  the  mea¬ 
sures  has  a  reliability  of  .6  and  the  other  .8,  the  maximum  observed2  correlation  would  be  .69. 
Ignoring  the  consequences  of  reliability  of  the  measures,  the  conclusion  would  be  that  there 
is  a  moderate  to  strong  correlation  rather  than  the  perfect  correlation  obtained  at  the  true 
score  level.  A  practical  consequence  of  this  might  be  the  search  for  new  predictors  to  close 
the  (specious)  gap  between  .69  and  1 .0. 

Observed  correlations  can  be  corrected  for  the  unreliability  of  the  variables  by  using  an 
algebraic  manipulation  of  Equation  2  to  yield 

rc  ~  rxr  I  (~Jrxx'  ■\jryv‘  )•  ® 

Consider  an  observed  correlation  of  .72,  where  both  variables  X  and  Thave  reliabilities  of  .8. 
Using  the  equation  above  for  correcting  the  correlation,  the  true  correlation  between  X  and  Y 
is  .9.  That  is,  rc  =  .72 X  V^)  =  .9. 

Correlations  between  variables  that  change  from  low  or  moderate  to  moderate  or  high 
after  correction  for  (less  than  perfect)  reliability  suggest  that  the  variables’  utility  could  be 
improved  if  they  were  made  more  reliable.  In  addition,  low  to  moderate  correlations  that  do 
not  increase  in  magnitude  after  correction  for  unreliability  suggest  that  the  variables  contain 
other  sources  of  variance. 
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Partial  Correlation 

Partial  correlation  is  the  correlation  between  two  variables,  X  and  Y,  while  holding  a  third 
variable,  Z,  constant.  Whether  used  for  control  or  for  selecting  variables  for  stepwise  regres¬ 
sion,  the  role  of  reliability  in  partial  correlation  can  be  large.  Consider  the  following  example 
with  three  variables,  X,  Y,  and  Z,  which  measure  the  same  construct  with  perfect  reliability. 
The  true  correlation  between  X  and  Y,  X  and  Z,  and  Y  and  Z  would  be  1 .0.  The  partial  correla¬ 
tion  between  any  pair  holding  the  other  constant  (i.e.,  partialing  it  out)  would  be  .0.  If  the  reli¬ 
ability  of  all  measures  were  .8,  the  partial  correlation  can  be  given  as  .44. 

r  _  rxr  (j)  ~  rxz  (j)  ryz  (j)  _  44 

'  V'l-(rxz  x  .8) 2  -Jl  -  rri  x  .8) 2 

Note  that  the  value  goes  from  no  partial  relationship  (.0)  to  a  moderate  (.44)  partial  relation¬ 
ship.  This  is  a  big  difference  and  might  have  substantial  implications  for  theory,  application, 
and  policy.  Caution  is  urged  in  interpretation. 

If  Z  were  a  variable  used  for  control  by  partialing  it  out,  its  reliability  would  be  influential 
in  the  estimation.  For  example,  researchers  partialed  out  age  (Z  in  this  example)  to  estimate 
the  true  correlation  between  leg  length  (X)  and  running  speed  (T).  Suppose  that  age  (Z),  the 
variable  to  be  partialed  out,  had  a  reliability  of  .4  and  the  triad  of  correlations  among  X,  Y,  and 
Z  was  truly  1 .  The  observed  partial  correlation  between  leg  length  (X)  and  running  speed  (Y) 
would  be  .29  rather  than  .0.  The  observed  correlation  of  .29  is  a  poor  estimate  of  the  true  cor¬ 
relation,  and  the  researcher  would  make  erroneous  conclusions  about  the  relationship 
between  the  variables. 


Linear  Regression  Coefficients 

Simple  Linear  Regression  With  One  Predictor 

Consider  a  simple  linear  regression  of  Y  on  X.  In  the  explanation  of  this  regression,  many 
statistics  texts  contain  a  single  line  such  as  “it  is  assumed  that  all  predictors  are  fixed  variables 
measured  without  error.”  The  role  of  measurement  error  in  estimation  of  raw  score  regression 
weights,  b,  is  given  by 

b-  P  fra,  (4) 

and  for  the  regression  constant  (or  intercept),  we  have 

a  =  Y-  (b/rxrjx.  (5) 


In  the  case  of  a  one-predictor  regression,  the  effect  is  direct  and  easy  to  understand.  The  b 
coefficient  is  biased  toward  zero,  and  the  a  coefficient  is  inflated.  They  are  biased  estimates  of 
the  population  parameters.  Unreliability  in  the  criterion  has  no  biasing  effect  on  the  regres¬ 
sion  coefficients;  however,  it  does  attenuate  the  correlation  between  the  predictor  and  crite¬ 
rion.  There  is  a  simple  method  to  correct  these  biased  estimates.  The  b  coefficient  is  divided 
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by  the  reliability  of  the  predictor  variable,  and  this  b  coefficient  is  then  placed  in  Equation  5 
for  the  intercept. 

Suppose  job  performance  criterion  Y  is  regressed  on  test  X,  yielding  the  regression  equa¬ 
tion  Y  =  2.0  +  1.6X  and  that  the  reliability  of  test  X  is  .80.  Correcting  the  b  coefficient  gives 
(1 .6/. 8  =  2),  and  assuming  means  of  5  for  theX  variables  and  10  for  the  Y  variables,  correcting 
the  intercept  gives  10  -  2(5)  =  0.  The  corrected  regression  equation  is  Y=  0  +  2X. 

Multiple  Regression 

When  there  are  multiple  predictors,  the  effects  on  the  regression  coefficients  become  com¬ 
plex  and  difficult  to  specify  simply.  The  effect  of  reliability  is  a  function  of  the  reliability 
magnitudes  and  the  true  score  correlations  among  the  predictors.  Unreliability  in  the  crite¬ 
rion  has  no  biasing  effect  on  the  regression  coefficients;  however,  it  does  attenuate  the  multi¬ 
ple  correlations  between  the  predictors  and  criterion.  Aiken  and  West  (1991)  provided  an 
instructive  example  for  the  case  of  two  independent  variables,  X  and  Z,  used  to  predict  the  cri¬ 
terion  Y.  In  this  case,  the  standardized  regression  weight  being  estimated  is  the  partial  regres¬ 
sion  coefficient  of  Y  on  X  holding  out  the  effect  of  Z.  The  effect  of  the  unreliability  of  the  vari¬ 
able  being  partialed  out  has  a  substantial  effect  on  the  partial  regression  coefficient  of  the 
other  variable.  Even  if  one  independent  variable  in  a  regression  were  measured  with  perfect 
reliability,  the  unreliability  of  the  other  independent  variables  will  have  a  biasing  effect  on  the 
regression  coefficient  associated  with  the  independent  variable  measured  without  error.  The 
standardized  regression  coefficient  is  given  by 

byx.Z—  (rYX~  rrerxz/(l  -  rxz)-  (6) 

To  correct  this  equation  for  unreliability  of  variable  Z,  it  is  necessary  to  write  it  as 

cbyxz—  (rYxrzz’ —  rJzrxzV(l  —  rxz)-  (2) 

For  example,  if  X  is  measured  without  error,  the  reliability  of  Z  is  .64,  and  rYX  -rn-  rxz  = 
.5,  the  corrected  standardized  coefficient  is 


cbyxJi.5  x  .64)  -  (.5  x  ,5))/(l  -,5)  =  .07/.5  =  .14. 


The  two-variable  case  can  be  extended  to  the  case  of  many  independent  variables. 

Interpretation  of  Regression  Coefficients 

The  failure  to  include  reliability  in  the  interpretation  of  the  regression  equation  causes 
problems  in  several  ways  depending  on  the  use  made  of  the  regression  equation  and  its  coeffi¬ 
cients.  The  first  is  in  the  interpretation  of  the  relative  importance  of  the  constructs  related  to 
the  predictors.  Frequently,  researchers  compare  weights  and  derive  meaning  of  the  relative 
importance  of  the  constructs  represented  by  the  observed  variables  such  as  verbal  or  mathe¬ 
matical  ability.  The  uncorrected  regression  weights  are  not  dependable  indicators  of  the 
importance  of  the  independent  variables;  therefore,  interpretation  of  them  can  lead  to 
erroneous  conclusions. 

Consider  an  aptitude  test  with  three  equally  reliable  measures  representing  reading  skill, 
mathematics  knowledge,  and  space  perception.  Furthermore,  the  source  of  validity  is  limited 
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to  the  common  first  factor  (i.e.,  g)  underlying  each  test  in  the  battery  and  no  validity,  in  this 
example,  is  due  to  the  specific  measurement  (i.e.,  s)  of  each  test.  Under  these  conditions,  each 
test  should  have  the  same  regression  weight  when  used  in  a  regression  equation  to  predict  the 
criterion.  However,  if  there  are  differences  in  test  reliabilities,  the  regression  coefficients  will 
vary  differentially  from  their  true  population  values.  Suppose  these  three  example  tests  have 
reliabilities  of  .65,  .70,  and  .85,  respectively.  In  estimation,  the  three  regression  coefficients 
will  differ  because  of  their  reliability.  For  example,  if  the  three  uncorrected  regression  coeffi¬ 
cients  were  .195,  .210,  and  .255,  some  might  interpret  this  to  mean  that  space  perception  is 
1.3  times  (.255/.  195)  as  important  as  reading  skill.  In  reality,  the  only  difference  is  in  the 
reliability  of  the  measures. 

In  his  computer  programs  called  “Package,”  John  Hunter  (personal  communication, 
May  1,  1995)  has  a  regression  procedure  that  allows  for  explicit  correction  for  unreliability 
and  corrects  the  regression  coefficients. 

Even  if  the  reliability  of  the  measures  starts  out  the  same,  prior  selection  leads  to  reduction 
of  reliability  in  the  sample.  Prior  selection  refers  to  the  process  of  selecting  a  sample  using 
some  method,  such  as  minimum  qualification  scores,  that  changes  the  variability  of  the 
scores  in  that  sample.  Gulliksen  (1950,  1987,  p.  124,  Equation  5)  provides  the  following 
equation  to  show  the  relationship  between  prior  selection  and  reliability. 


tf„=l-(j;/S*X  1-fr*)-  (8) 

Consider  the  previous  example  with  the  three  tests  in  which  the  sample  has  been  selected 
on  the  basis  of  scores  on  the  reading  skill  test,  which  has  caused  indirect  selection 
(Thorndike,  1949)  to  occur  on  the  mathematics  and  space-perception  tests.  This  indirect 
selection  is  the  result  of  the  correlation  of  the  variables.  Given  the  same  true  regression  coef¬ 
ficients  and  reduction  in  variance  of  50%,  30%,  and  20%,  respectively,  for  reading  skill, 
mathematics,  and  space  perception,  the  reliabilities  of  the  tests  have  changed  differentially. 
The  regression  coefficients  thus  become  differentially  biased  and  poor  estimates  of  the  popu¬ 
lation  values.  Some  would  interpret  these  coefficients,  and  clearly,  erroneous  conclusions 
would  be  drawn. 

Another  use  of  regression  coefficients  is  in  production  of  individual  job-specific  regres¬ 
sion  equations  for  personnel  classification.  Johnson  and  Zeidner  (1991)  have  called  for  the 
use  of  linear  programming  to  achieve  optimal  assignment  of  individuals  to  jobs  by  such  sys¬ 
tems.  When  the  regression  coefficients  are  computed  in  several  range-restricted  samples  of 
job  incumbents,  the  prior  selection  of  the  job  incumbents  causes  the  reliabilities  of  the  tests  to 
vary  from  sample  to  sample  (Gulliksen  1950,  1987,  p.  124,  Equation  5).  These  varying 
reliabilities  cause  biases  in  the  regression  coefficients.  In  addition,  the  effect  of  the  potential 
removal  of  homoscedasticity  because  of  range  restriction  induced  by  prior  selection  also 
biases  the  regression  coefficients.  When  samples  are  preselected  and  homoscedasticity  is 
maintained,  the  regression  coefficient  in  the  selected  sample  will  not  show  bias  due  to 
heteroscedasticity  (Cohen  &  Cohen,  1983).  The  benefits  of  the  use  of  these  biased  coeffi¬ 
cients  in  optimization  (Johnson  &  Zeidner,  1991)  may  be  illusory  and  due  to  nothing  more 
than  the  reliability  artifact. 

Any  technique  that  uses  regression  coefficients  such  as  clustering,  profile  analysis 
(Nunnally  &  Bernstein,  1994)  or  policy  capturing  (Ward  &  Jennings,  1973)  must  take  the 
unreliability  of  the  variables  into  account  or  inappropriate  inferences  will  be  made. 
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Test  Bias  Detection 

Jensen  (1980,  pp.  383-386)  and  others  (Cohen  &  Cohen,  1983;  Crocker  &  Algina,  1986; 
Fuller,  1987)  have  shown  that  less  than  perfect  reliability  can  influence  the  interpretation  of 
models  of  test  bias  that  rely  on  examination  of  regression  slope,  intercept,  and  standard  error 
of  estimate.  What  may  be  mistakenly  interpreted  as  test  bias  may  in  fact  be  due  solely  to  unre¬ 
liability.  As  Jensen  noted,  “Before  concluding  that  a  test  is  intrinsically  biased,  it  should  be 
determined  how  much  of  the  apparent  bias  is  attributable  to  the  unreliability  of  the  test”  (p. 
383). 

Test  unreliability  disadvantages  high-scoring  individuals,  regardless  of  their  group  (e.g., 
ethnicity/race,  gender,  socioeconomic)  membership.  Therefore,  any  group  with  proportion¬ 
ally  fewer  high-scoring  members  will  benefit  (as  a  group)  from  a  test’s  unreliability.  As  noted 
by  Jensen  (1980),  Hunter  and  Schmidt  (1976,  p.  1056)  suggested  that  test  unreliability  by 
itself  might  account  for  half  of  the  overprediction  of  grade  point  average  for  Blacks  reported 
in  the  literature. 

In  an  unbiased  test  with  perfect  reliability,  by  definition,  the  slope,  intercept,  and  standard 
error  of  estimate  are  the  same  for  the  groups  being  compared.  Through  several  illustrative 
examples,  Jensen  (1980)  showed  that  even  in  an  unbiased  test,  unreliability  reduces  the 
regression  slope,  produces  group  differences  in  the  Y  intercept,  and  increases  the  standard 
error  of  estimate. 

Regression  Slope 

In  a  perfectly  reliable  test,  the  observed  slope  will  be  bn.  When  reliability  is  less  than  1 ,  the 
slope  becomes  r^Jb^.  If  the  reliability  of  the  test  were  zero,  the  regression  line  would  be  hori¬ 
zontal  (no  slope).  There  is  no  group-difference  effect  of  test  unreliability  on  the  slope,  unless 
the  reliabilities  differ  in  the  two  groups. 

Regression  Intercept 

Interpretation  of  regression  intercepts  is  hazardous  when  the  predictor  is  not  perfectly  reli¬ 
able.  Jensen  (1980)  showed  that  if  the  test’s  reliability  is  less  than  perfect  and  there  are  two 
groups  and  a  single  regression  line,  there  must  be  two  intercepts  found  solely  because  of  the 
unreliability  of  the  predictors.  The  difference  in  intercepts  for  the  two  groups  will  increase  by 
an  amount  equal  to  A(kA  -  kB)_=  f%(  1  -  rxx)(XA  -  XB),  where  kA  and  kB  are  the  intercepts  for 
groups  A  and  B  and  XA  and  XB  are  the  means.  Furthermore,  is  the  raw  score  regression 
coefficient  for  the  regression  of  Y  on  X,  and  rxx  is  the  reliability  of  predictor  X.  The  expected 
difference  in  intercepts  is  a  function  of  group  means,  regression  coefficient,  and  predictor 
reliability.  For  example,  if  the  regression  coefficient  were  1  and  XA  and  XB  were  1 0  and  5  for 
a  test  (X)  with  reliability  of  .9,  the  expected  difference  in  intercepts  would  be  0.5.  If  the  reli¬ 
ability  were  decreased  to  .7,  the  expected  difference  in  intercepts  would  increase  to  1 .5.  If  the 
reliability  decreased  further  to  .5,  the  expected  intercept  difference  would  increase  to  2.5. 
The  nature  and  magnitude  of  the  artifact  is  made  clear  when  we  contrast  this  to  the  circum¬ 
stance  in  which  reliability  is  perfect  and  a  zero  difference  in  intercepts  is  found.  The  uncriti¬ 
cal  interpretation  of  different  intercepts  as  bias  is  unwarranted. 
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Standard  Error  of  Estimate 

Test  unreliability  increases  the  standard  error  of  estimate  (SEy)  by  an  amount  equal  to 
A SEr  =  -  (4  rxx  )i/l  -  rXY ).  Test  unreliability  increases  the  amount  of  overlap  of 

the  distributions  of  the  predicted  criterion  scores  for  the  two  groups  being  compared.  Finally, 
test  unreliability  decreases  the  standard  deviation  of  the  predicted  criterion,  cr  =  ay  -^rxx , 

by  an  amount  equal  to  A aY  =  ((1  -yjl-  rxx  )ay ). 

An  Example  of  Corrected  Test  Bias  Detection  Analyses 

Carretta  (1997)  provided  a  practical  example  in  a  study  of  gender  and  ethnic  group  differ¬ 
ences  in  the  predictive  utility  of  aptitude  composites  used  to  select  U.S.  Air  Force  pilot  train¬ 
ees.  Uncorrected  results  showed  group  differences  in  predicted  pilot  training  completion 
rates  with  overestimation  for  the  minority  group  (women  =  .07  and  Hispanics  =  .12)  relative 
to  the  majority  group  (men  and  Whites).  After  correction  for  unreliability  of  the  predictors, 
all  differences  were  reduced  to  a  trivial  .0004  or  less. 

Validity  Coefficient 

Test  unreliability  reduces  the  validity  coefficient  for  both  groups  by  an  amount  equal  to 
&rxr  ~  (•/* _  rxx  vxy  )•  1°  addition,  test  unreliability  increases  the  amount  of  overlap  of  the 
distributions  of  the  predicted  criterion  scores  for  the  two  groups  being  compared.  Finally,  test 
unreliability  decreases  the  standard  deviation  of  the  predicted  criterion,  a  r  -oy  by  an 

amount  equal  to  A ar  =  (^/l  -  rxxay  ). 

A  particularly  interesting  situation  occurs  in  the  tests  of  predictive  bias  (Cole,  1973)  using 
regression  models  (Lautenschlager  &  Mendoza,  1986).  Usually  the  first  test  of  models  in 
the  detection  of  bias  is  a  comparison  of  a  four-parameter  regression  model  against  a  two- 
parameter  model.  The  two  models  tested  are 

Y  =  al  +  blS  +  b2X  +  b)XS  (9) 

and 

Y  =  a4  +  b4X  (10), 

where  X  is  a  test  score,  S  is  a  categorical  variable  (often  called  a  dummy  variable)  of  0  and  1 
denoting  group  membership,  and  XS  is  the  cross-product  of  X  and  5.  Note  that  XS  has  a  pecu¬ 
liar  distribution,  with  zeros  for  the  group  coded  0  and  test  scores  for  the  group  coded  1 .  In  the 
first  model,  a,  and  bx  are  intercepts  and  b2  and  b3  are  slopes.  In  the  simpler  model,  a4  is  the 
intercept  and  b4  is  the  slope.  The  first  regression  model  can  provide  two  lines;  the  second 
regression  model  can  provide  only  one  line.  Frequently,  the  two  groups  considered  have  a 
mean  score  difference  of  la.  Consequently,  the  test  has  a  different  reliability  for  each  group, 
and  depending  on  the  placement  of  the  minimum  cut  score,  the  reliability  may  be  made  to  dif¬ 
fer  further  between  the  groups  after  selection.  The  effects  of  unreliability  in  the  full  model  are 
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more  difficult  to  specify  than  in  the  reduced  model,  and  comparison  of  the  models  in  the  pres¬ 
ence  of  measurement  error  may  lead  to  inappropriate  inferences  in  the  population. 

Factor  Analysis 

The  role  of  reliability  in  factor  analysis  is  well  known  and  generally  straightforward.  The 
general  model  of  factor  analysis  is  that  the  variance  of  the  observed  variable  is  a  linear  combi¬ 
nation  of  common  factors  and  unique  factors.  The  ratio  of  the  variance  associated  with  the 
common  factor  to  the  total  variance  of  an  observed  variable  is  known  as  the  communality  of 
the  observed  variable  (Fuller,  1987,  pp.  60-61).  Using  Fuller’s  notation,  communality  can  be 
written  as 


*u  =[Pn°„  +®«>ir'P?!0«  =l-a;lnaeeU.  (11) 

This  quantity  kfx ,  the  communality,  is  an  estimate  of  the  reliability  of  the  variable.  The 
communality  of  a  variable  provides  a  lower  bound  estimate  of  its  reliability  (Baggaley, 
1964).  It  is  a  lower  bound  estimate  because  it  does  not  include  the  reliable  variance  measured 
by  specific  factors.  The  unique  variance  or  uniqueness  of  a  variable  is  (1  -  communality).  The 
unique  factors  are  composed  of  specific  variance  and  error  variance.  Symbolically,  these 
relationships  can  be  expressed  as 


X  =  h  +  u 


(12) 


or 


X  =  h  +  s  +  e,  (13) 

where  X  is  an  observed  variable,  h  is  the  commonality,  u  is  the  uniqueness,  s  is  the  specific, 
and  e  is  the  error. 

If  the  variable  is  associated  with  the  factor,  as  the  reliability  of  the  observed  variable 
increases  and  the  error  decreases,  the  loadings  of  the  variable  on  the  factors  can  be  expected 
to  increase.  For  example,  if  there  are  three  variables  that  have  true  loadings  that  are  equal  but 
are  measured  with  differing  reliability,  the  observed  loadings  will  differ  as  a  function  of  the 
reliability,  with  the  more  reliably  measured  variables  receiving  higher  loadings.3  Interpreta¬ 
tions  of  these  observed  loadings  will  lead  to  erroneous  conclusions  about  the  factorial  causa¬ 
tion  of  the  variables  because  the  differences  are  due  to  differing  reliabilities  and  not  differing 
relationships  to  the  factor.  To  correct  factor  loadings  for  unreliability,  the  loadings  for  the 
observed  variables  can  be  divided  by  the  reliability  of  the  observed  variables.  These  cor¬ 
rected  loadings  give  more  appropriate  estimates  of  the  true  relationships  of  the  factors  to  the 
observed  variables. 

Ree  and  Carretta  (1998)  reported  a  study  that  showed  the  correlation  between  the 
unrotated  first-factor  loadings  of  multiple  aptitude  battery4  scores  and  average  validity  of 
those  scores.  The  correlation  was  .76.  The  factor  loadings  were  then  corrected  for  unreliabil¬ 
ity,  and  the  correlation  became  .98. 
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Analysis  of  Variance  (ANOVA)  and 
Analysis  of  Covariance  (ANCOVA) 

ANOVA  and  ANCOVA  are  examples  of  the  linear  model  as  is  regression  analysis.  The 
effects  of  measurement  error  are  similar;  however,  the  independent  variables  in  ANOVA  are 
usually  uncorrelated  owing  to  random  assignment  of  participants.  As  stated  above,  the 
effects  of  unreliability  on  uncorrelated  independent  variables  are  simpler. 

ANOVA 

Let  us  consider  a  one-way  ANOVA  with  three  levels  of  the  independent  variable  with  fxt, 
(4,  and  |i3.  Remembering  that  ANOVA  is  a  linear  model  and  that  the  parameter  estimates  can 
be  found  by  means  of  regression,  we  note  that  |i,  =  a  +  (J,,  (t2  =  a  +  P2,  and  |i3  =  a,  where  a  is 
the  regression  additive  constant  (intercept)  and  (5,  and  (32  are  the  multiplicative  partial  regres¬ 
sion  coefficients  for  the  two  categorical  variables  needed  to  represent  the  three  levels  of  the 
independent  variable.  Furthermore,  note  that  a = p3,  £),  =  (X,  -  |i2,  and  P2  -  |i2  -  |X3.  Suppose  |X,  = 
-4- 1 ,  |X2  =  0,  and  (i3  =  -l  and  that  the  reliabilities  rxy]  =  rm  =  rXX3  =  .50.  The  true  differences  are  1 
or  2  points,  but  there  is  a  loss  of  statistical  power.  In  addition,  the  effect  size  (e.g.,  (p,  -  p,)/c) 
may  be  substantially  underestimated  because  a  is  inflated  by  error  variance.  Much  the  same 
may  be  found  in  an  N- way  ANOVA.  Consider  atwo-way  ANOVA  with  the  independent  vari¬ 
ables  of  gender  and  political  party  affiliation.  There  are  two  gender  (male/female)  and  three 
political  affiliation  (Democrat,  Independent,  and  Republican)  levels,  respectively.  Using  the 
same  logic  as  before,  the  group  means  may  be  represented  as  follows: 


Gender 

Political  Affiliation 

Group  Means 

Female 

Democrat 

Hn  =  a  +  pi  +  p3 

Independent 

=  a  +  p2  +  P3 

Republican 

Fd  =  «  +  P3 

Male 

Democrat 

Hmi  =  a  +  Pi 

Independent 

gm2  =  a  +  p2 

Republican 

Pm3  =  a 

Again,  both  statistical  power  and  effect  size  may  be  reduced.  If  the  two  independent  vari¬ 
ables  above  are  correlated,  as  they  well  may  be  given  the  impossibility  of  randomly  assigning 
gender  and  party  affiliation,  the  same  biases  will  be  found  as  in  a  multiple  regression  with  cor¬ 
related  predictors.  With  less  than  perfectly  reliable  variables,  the  results  can  be  very 
misleading. 

An  instructive  example  is  provided  in  the  work  of  Guttman  (2000)  in  a  study  of  16-  to  40- 
year-old  females.  Independent  variables  for  the  analysis  of  variance  were  based  on  meeting 
the  criteria  in  the  Diagnostic  and  Statistical  Manual  of  Mental  Disorders  (3rd  ed.,  revised; 
American  Psychiatric  Association,  1987)  for  the  clinical  conditions  of  anorexia  nervosa  and 
borderline  personality  disorder.  The  control  participants  were  admitted  to  the  study  on  three 
less  than  perfectly  reliable  self-report  clinical  instruments.  Of  particular  interest  were  the 
dependent  variables  assessed  by  a  28-item  measure  of  an  individual’s  cognitive  and  emo¬ 
tional  capacity  for  empathy.  This  instrument  yielded  four  scales  whose  median  reliability 
was  reported  by  Guttman  to  be  about  .70.  The  DSM-III-R  assessments  have  less  than  perfect 
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reliability,  and  .6  is  a  reasonable  approximation  of  the  reliability  of  the  categorization  into  the 
anorexic  and  borderline  personality  groups.  Given  the  magnitudes  of  less  than  perfect  reli¬ 
ability  in  both  the  independent  and  dependent  variables,  it  is  likely  that  all  effect  sizes  were 
underestimated  and  many  significant  differences  were  undetected. 

ANCOVA 

ANCOVA  is  a  linear  model  with  categorical  variables  and  at  least  one  continuous  variable 
as  a  covariate.  For  example,  suppose  we  were  interested  in  examining  the  effects  of  political 
party  affiliation,  a  three-level  categorical  variable  (Democrat,  Independent,  and  Republican) 
and  annual  income  measured  in  dollars  earned  (a  continuous  variable)  on  amount  of  support 
for  the  president’s  proposed  budget  (a  continuous  variable).  The  three  levels  of  party  affilia¬ 
tion  are  represented  by  two  categorical  independent  variables  (X,  and  X2).  Income  level  (X3)  is 
the  continuous  covariate  independent  variable.  The  following  linear  model  represents  the 
relationship  of  party  affiliation  and  income  to  the  dependent  variable,  support  for  the  presi¬ 
dent’s  budget  (T): 

Y  =  a  +  p,X,  +  P2X2  +  PjX3.  (14) 

Considering  less  than  perfect  reliability  of  the  independent  variables,  the  equation  can  be 
rewritten  as 


Y-a'+r  np.X,  +  r22p2X2  +  r MpjX),  (15) 

where  rn,  r22,  and  r33  are  reliabilities  and  a'  denotes  the  additive  regression  coefficient 
affected  by  the  unreliability  of  the  independent  variables. 

The  effects  of  less  than  perfect  reliability  will  be  found  and  the  higher  the  correlation 
between  the  covariate  X3  and  the  independent  variables  X,  and  X2,  the  more  bias  will  be  noted 
in  the  analysis  and  the  greater  the  loss  of  statistical  power.  The  same  biases  will  be  found  as 
would  be  found  in  a  multiple  regression  with  correlated  predictors. 

Ameliorating  or  Correcting  the  Problem 

In  each  of  the  cases  we  reviewed,  it  has  been  shown  that  using  less  than  perfectly  reli¬ 
able  variables  creates  bias  in  the  parameter  estimates.  This  reduces  statistical  power  and  pro¬ 
vides  the  opportunity  for  misinterpretation  of  findings  and  misstatement  of  fundamental 
relationships. 

There  are  several  ways  to  ameliorate  or  correct  this  problem.  The  first  approach  is  to  use 
variables  that  yield  highly  reliable  scores  for  your  sample  (see  Ree,  Carretta,  &  Steindl, 
2001).  Revising  unreliable  test  items  or  observational  techniques,  adding  test  items  or  obser¬ 
vations,  revising  vague  or  confusing  instructions,  and  clarifying  ambiguous  scoring  and  cod¬ 
ing  procedures  can  accomplish  this.  This  alleviates  most  of  the  problems  but  does  not  entirely 
remove  the  bias  due  to  unreliability.  A  second  approach  is  to  correct  the  observed  variables 
for  the  effects  of  unreliability  and  conduct  the  analyses  on  the  corrected  values.  This  can  be 
accomplished  with  reliability  estimates  from  the  participant  sample  in  the  study.  Finally,  the 
use  of  latent  variable  analyses,  such  as  confirmatory  factor  analyses  or  structural  equation 
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modeling,  which  eliminate  or  substantially  reduce  the  unreliability  of  the  variables,  is  a  third 
worthwhile  approach. 

Cohen  and  Cohen  (1983,  p.  411)  reported  that  Dunivant  (1981)  conducted  simulation 
studies  to  evaluate  the  last  two  approaches  and  concluded  that  both  “have  merit  and  yield  rea¬ 
sonable  results.”  Unreliability  poses  a  threat  to  our  knowledge  and  practice,  whether  in  theo¬ 
retical  studies  or  in  practical  application.  Baugh  (2002)  expressed  it  well,  stating,  “As  the 
winds  of  change  continue  to  shape  responsible  research  practice,  it  is  hoped  that  researchers 
will  give  more  thoughtful  consideration  to  the  influence  that  measurement  error  variance 
exerts”  (p.  261). 


Notes 

1 .  We  follow  the  convention  of  using  Greek  letters  for  population  parameters  and  Roman  letters  for  sample  sta¬ 
tistical  estimates  of  population  parameters.  Equation  1  is  the  single  exception  to  this  rule  as  many  are  familiar  with 
the  equation  when  written  as  presented. 

2.  Due  to  sampling  error,  the  observed  correlation  could  take  on  numerous  higher  values.  We  present  the  maxi¬ 
mum  expected  observed  correlation. 

3.  We  noted  a  similar  finding  for  regression  coefficients  in  the  section  on  multiple  regression.  In  factor  analyses 
of  test  items  or  questionnaire  items,  it  may  be  difficult  to  estimate  the  reliability  of  items. 

4.  The  unrotated  first  factor  of  a  multiple  aptitude  battery  is  a  measure  of  general  cognitive  ability  (g). 
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